L11: Dates, lags, sorting

import pandas as pd
import numpy as np

df = pd.read_excel('./rawdata12.xlsx')
df

	firmid	date	return
0	1	09/30/2010	0.050
1	2	10/31/2010	0.450
2	3	11/30/2010	23.000
3	1	11/30/2010	0.870
4	2	11/30/2010	0.200
5	3	12/31/2010	0.340
6	1	12/31/2010	0.060
7	2	12/31/2010	0.001
8	3	12/31/2010	-0.120

Date formats: “pd.to_datetime()” and “dt.to_period()”

Pandas offers a lot of flexibility to manipulate dates and time stamps. Much of this functionality can only be used on columns that have the Pandas “datetime” data type. We can convert dates to this data type using the .to_datetime() function.

First, not that the date column in our dataframe is of type “object”:

df.dtypes

firmid      int64
date       object
return    float64
dtype: object

We’ll create a copy of the df dataframe to avoid changing the original data in df:

df2 = df.copy()

df2['dtdate'] = pd.to_datetime(df2['date'])
df2

	firmid	date	return	dtdate
0	1	09/30/2010	0.050	2010-09-30
1	2	10/31/2010	0.450	2010-10-31
2	3	11/30/2010	23.000	2010-11-30
3	1	11/30/2010	0.870	2010-11-30
4	2	11/30/2010	0.200	2010-11-30
5	3	12/31/2010	0.340	2010-12-31
6	1	12/31/2010	0.060	2010-12-31
7	2	12/31/2010	0.001	2010-12-31
8	3	12/31/2010	-0.120	2010-12-31

df2.dtypes

firmid             int64
date              object
return           float64
dtdate    datetime64[ns]
dtype: object

Now we can apply many useful date functions (they usually have the prefix dt.) to this datetime variable. For example, we can extract information about specific components of the date:

df2['year'] = df2['dtdate'].dt.year
df2['month'] = df2['dtdate'].dt.month
df2['day'] = df2['dtdate'].dt.day
df2

	firmid	date	return	dtdate	year	month	day
0	1	09/30/2010	0.050	2010-09-30	2010	9	30
1	2	10/31/2010	0.450	2010-10-31	2010	10	31
2	3	11/30/2010	23.000	2010-11-30	2010	11	30
3	1	11/30/2010	0.870	2010-11-30	2010	11	30
4	2	11/30/2010	0.200	2010-11-30	2010	11	30
5	3	12/31/2010	0.340	2010-12-31	2010	12	31
6	1	12/31/2010	0.060	2010-12-31	2010	12	31
7	2	12/31/2010	0.001	2010-12-31	2010	12	31
8	3	12/31/2010	-0.120	2010-12-31	2010	12	31

Another common use of the .to_datetime() function is to construct a datetime variable from date components:

df2['newdate'] = pd.to_datetime(df2[['year','month','day']])
df2

	firmid	date	return	dtdate	year	month	day	newdate
0	1	09/30/2010	0.050	2010-09-30	2010	9	30	2010-09-30
1	2	10/31/2010	0.450	2010-10-31	2010	10	31	2010-10-31
2	3	11/30/2010	23.000	2010-11-30	2010	11	30	2010-11-30
3	1	11/30/2010	0.870	2010-11-30	2010	11	30	2010-11-30
4	2	11/30/2010	0.200	2010-11-30	2010	11	30	2010-11-30
5	3	12/31/2010	0.340	2010-12-31	2010	12	31	2010-12-31
6	1	12/31/2010	0.060	2010-12-31	2010	12	31	2010-12-31
7	2	12/31/2010	0.001	2010-12-31	2010	12	31	2010-12-31
8	3	12/31/2010	-0.120	2010-12-31	2010	12	31	2010-12-31

The other, very commonly used type for date data is the Pandas period format. This is used to specify that your data has a particular frequency, and can be done by applying the .to_period() function to a datetime variable (e.g. use ‘Y’ for yearly frequency data, ‘M’ for monthly, and ‘Q’ for quarterly):

df2['yearly'] = df2['dtdate'].dt.to_period('Y')
df2['monthly'] = df2['dtdate'].dt.to_period('M')
df2['quarterly'] = df2['dtdate'].dt.to_period('Q')
df2

	firmid	date	return	dtdate	year	month	day	newdate	yearly	monthly	quarterly
0	1	09/30/2010	0.050	2010-09-30	2010	9	30	2010-09-30	2010	2010-09	2010Q3
1	2	10/31/2010	0.450	2010-10-31	2010	10	31	2010-10-31	2010	2010-10	2010Q4
2	3	11/30/2010	23.000	2010-11-30	2010	11	30	2010-11-30	2010	2010-11	2010Q4
3	1	11/30/2010	0.870	2010-11-30	2010	11	30	2010-11-30	2010	2010-11	2010Q4
4	2	11/30/2010	0.200	2010-11-30	2010	11	30	2010-11-30	2010	2010-11	2010Q4
5	3	12/31/2010	0.340	2010-12-31	2010	12	31	2010-12-31	2010	2010-12	2010Q4
6	1	12/31/2010	0.060	2010-12-31	2010	12	31	2010-12-31	2010	2010-12	2010Q4
7	2	12/31/2010	0.001	2010-12-31	2010	12	31	2010-12-31	2010	2010-12	2010Q4
8	3	12/31/2010	-0.120	2010-12-31	2010	12	31	2010-12-31	2010	2010-12	2010Q4

These type of period date are useful for many operations on the data, the most important one being that Pandas understand what you mean if you want to add or subtract some number of periods from a given date. For example:

df2['lastm'] = df2['monthly'] - 1
df2['nextq'] = df2['quarterly'] + 1
df2

	firmid	date	return	dtdate	year	month	day	newdate	yearly	monthly	quarterly	lastm	nextq
0	1	09/30/2010	0.050	2010-09-30	2010	9	30	2010-09-30	2010	2010-09	2010Q3	2010-08	2010Q4
1	2	10/31/2010	0.450	2010-10-31	2010	10	31	2010-10-31	2010	2010-10	2010Q4	2010-09	2011Q1
2	3	11/30/2010	23.000	2010-11-30	2010	11	30	2010-11-30	2010	2010-11	2010Q4	2010-10	2011Q1
3	1	11/30/2010	0.870	2010-11-30	2010	11	30	2010-11-30	2010	2010-11	2010Q4	2010-10	2011Q1
4	2	11/30/2010	0.200	2010-11-30	2010	11	30	2010-11-30	2010	2010-11	2010Q4	2010-10	2011Q1
5	3	12/31/2010	0.340	2010-12-31	2010	12	31	2010-12-31	2010	2010-12	2010Q4	2010-11	2011Q1
6	1	12/31/2010	0.060	2010-12-31	2010	12	31	2010-12-31	2010	2010-12	2010Q4	2010-11	2011Q1
7	2	12/31/2010	0.001	2010-12-31	2010	12	31	2010-12-31	2010	2010-12	2010Q4	2010-11	2011Q1
8	3	12/31/2010	-0.120	2010-12-31	2010	12	31	2010-12-31	2010	2010-12	2010Q4	2010-11	2011Q1

Sorting

We can sort a dataframe based on the values in a particular column using the .sort_values() function. To sort based on the values in an index, we use the .sort_index() function.

.sort_values()

Syntax:

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False)

df.sort_values('firmid')

	firmid	date	return
0	1	09/30/2010	0.050
3	1	11/30/2010	0.870
6	1	12/31/2010	0.060
1	2	10/31/2010	0.450
4	2	11/30/2010	0.200
7	2	12/31/2010	0.001
2	3	11/30/2010	23.000
5	3	12/31/2010	0.340
8	3	12/31/2010	-0.120

df.sort_values('firmid', ascending = False)

	firmid	date	return
2	3	11/30/2010	23.000
5	3	12/31/2010	0.340
8	3	12/31/2010	-0.120
1	2	10/31/2010	0.450
4	2	11/30/2010	0.200
7	2	12/31/2010	0.001
0	1	09/30/2010	0.050
3	1	11/30/2010	0.870
6	1	12/31/2010	0.060

Remember, functions that have an inplace parameter do not actually change the original dataset unless we set that parameter to True:

df

	firmid	date	return
0	1	09/30/2010	0.050
1	2	10/31/2010	0.450
2	3	11/30/2010	23.000
3	1	11/30/2010	0.870
4	2	11/30/2010	0.200
5	3	12/31/2010	0.340
6	1	12/31/2010	0.060
7	2	12/31/2010	0.001
8	3	12/31/2010	-0.120

df.sort_values('firmid', inplace = True)
df

	firmid	date	return
0	1	09/30/2010	0.050
3	1	11/30/2010	0.870
6	1	12/31/2010	0.060
1	2	10/31/2010	0.450
4	2	11/30/2010	0.200
7	2	12/31/2010	0.001
2	3	11/30/2010	23.000
5	3	12/31/2010	0.340
8	3	12/31/2010	-0.120

.sort_index()

Abbreviated syntax:

DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False)

df.sort_index()

	firmid	date	return
0	1	09/30/2010	0.050
1	2	10/31/2010	0.450
2	3	11/30/2010	23.000
3	1	11/30/2010	0.870
4	2	11/30/2010	0.200
5	3	12/31/2010	0.340
6	1	12/31/2010	0.060
7	2	12/31/2010	0.001
8	3	12/31/2010	-0.120

Lagging and leading

In the context of most financial datasets which contain time-indexed information, lagging a particular variable (column) means obtaining the values for that particular variable from a prior point in time. Leading a variable means obtaining values from a future point in time.

If you research how to lead and lag variables in with pandas dataframes, most sources (including the official Pandas user guide) claims that you can do this using the .shift() function. In this section I first show you how to use this method, and when it runs into problems. Then show you how you can create leads and lags in a more robust way (using period dates and the merge function).

Lagging and leading using `.shift()`

Syntax:

DataFrame.shift(periods=1, freq=None, axis=0, fill_value=NoDefault.no_default)

Suppose we want to create a new column lag_return which tells us, for each firm, its returns from the prior month. The general advice you’ll see is that you should first sort your dataframe by firm identifier and by date:

df2 = df.sort_values(['firmid','date']).copy()
df2

	firmid	date	return
0	1	09/30/2010	0.050
3	1	11/30/2010	0.870
6	1	12/31/2010	0.060
1	2	10/31/2010	0.450
4	2	11/30/2010	0.200
7	2	12/31/2010	0.001
2	3	11/30/2010	23.000
5	3	12/31/2010	0.340
8	3	12/31/2010	-0.120

And then use the .shift() function, after you tell Python that your dates are grouped at the firm level (each firm identifier has its own set of dates):

df2['lag_ret'] = df2.groupby('firmid')['return'].shift(1)
df2

	firmid	date	return	lag_ret
0	1	09/30/2010	0.050	NaN
3	1	11/30/2010	0.870	0.05
6	1	12/31/2010	0.060	0.87
1	2	10/31/2010	0.450	NaN
4	2	11/30/2010	0.200	0.45
7	2	12/31/2010	0.001	0.20
2	3	11/30/2010	23.000	NaN
5	3	12/31/2010	0.340	23.00
8	3	12/31/2010	-0.120	0.34

Note that the entries for lag_ret on the second row and the last row are not correct: they do not tell us what the return of the firm was in the prior month. This happens because

Our data has gaps in coverage (October 2010 is missing for firmid==1)
Our data has duplicates (there are two entries for December 2010 for firmid==3)

Note that both of these issues disappear if we first get rid of duplicates and if we reinterpret “lagging” to mean “the last available data point” not “data from the last calendar period” and “leading” to mean “the next available data point” and not “data from the following calendar period”. To keep things simple, this is the approach we will take in this course.

However, if this reinterpretation of lagging and leading is not exactly what you need for your application and you need to lag and lead in terms of calendar periods, you should follow the approach below:

Leading and lagging with `.merge()`

First, again, create a copy of the original dataset:

df2 = df.sort_values(['firmid','date']).copy()
df2

	firmid	date	return
0	1	09/30/2010	0.050
3	1	11/30/2010	0.870
6	1	12/31/2010	0.060
1	2	10/31/2010	0.450
4	2	11/30/2010	0.200
7	2	12/31/2010	0.001
2	3	11/30/2010	23.000
5	3	12/31/2010	0.340
8	3	12/31/2010	-0.120

We will first create a new date variable that tells Python the frequency of our dates:

df2['mdate'] = pd.to_datetime(df2['date']).dt.to_period('M')
df2

	firmid	date	return	mdate
0	1	09/30/2010	0.050	2010-09
3	1	11/30/2010	0.870	2010-11
6	1	12/31/2010	0.060	2010-12
1	2	10/31/2010	0.450	2010-10
4	2	11/30/2010	0.200	2010-11
7	2	12/31/2010	0.001	2010-12
2	3	11/30/2010	23.000	2010-11
5	3	12/31/2010	0.340	2010-12
8	3	12/31/2010	-0.120	2010-12

Now create a new dataframe containing the firm identifiers (firmid), the period date (mdate) and the variable we want to lag (return):

lags = df2[['firmid','mdate','return']].copy()
lags

	firmid	mdate	return
0	1	2010-09	0.050
3	1	2010-11	0.870
6	1	2010-12	0.060
1	2	2010-10	0.450
4	2	2010-11	0.200
7	2	2010-12	0.001
2	3	2010-11	23.000
5	3	2010-12	0.340
8	3	2010-12	-0.120

Now add (subtract in case of leads) the number of periods you want to lag the return variable (1 in our example) to the period date:

lags['mdate'] = lags['mdate'] + 1
lags

	firmid	mdate	return
0	1	2010-10	0.050
3	1	2010-12	0.870
6	1	2011-01	0.060
1	2	2010-11	0.450
4	2	2010-12	0.200
7	2	2011-01	0.001
2	3	2010-12	23.000
5	3	2011-01	0.340
8	3	2011-01	-0.120

And rename return to lag_return:

lags = lags.rename(columns={'return':'lag_return'})
lags

	firmid	mdate	lag_return
0	1	2010-10	0.050
3	1	2010-12	0.870
6	1	2011-01	0.060
1	2	2010-11	0.450
4	2	2010-12	0.200
7	2	2011-01	0.001
2	3	2010-12	23.000
5	3	2011-01	0.340
8	3	2011-01	-0.120

Finally, merge this lagged data into the original dataset:

df2 = df2.merge(lags, how='left', on=['firmid','mdate'])

df2.sort_values(['firmid','mdate'])

	firmid	date	return	mdate	lag_return
0	1	09/30/2010	0.050	2010-09	NaN
1	1	11/30/2010	0.870	2010-11	NaN
2	1	12/31/2010	0.060	2010-12	0.87
3	2	10/31/2010	0.450	2010-10	NaN
4	2	11/30/2010	0.200	2010-11	0.45
5	2	12/31/2010	0.001	2010-12	0.20
6	3	11/30/2010	23.000	2010-11	NaN
7	3	12/31/2010	0.340	2010-12	23.00
8	3	12/31/2010	-0.120	2010-12	23.00

Note that now the second and last entry in lag_return are correct.

You still have to contend with what it means that you have duplicate entries for return for December 2010 for firmid==3 but dealing with duplicates needs to be addressed on a case by case basis, depending on the particulars of the data you are using.